“Stealing fire or stacking knowledge” by machine intelligence to model link prediction in complex networks

Summary Current methodologies to model connectivity in complex networks either rely on network scientists’ intelligence to discover reliable physical rules or use artificial intelligence (AI) that stacks hundreds of inaccurate human-made rules to make a new one that optimally summarizes them together. Here, we provide an accurate and reproducible scientific analysis showing that, contrary to the current belief, stacking more good link prediction rules does not necessarily improve the link prediction performance to nearly optimal as suggested by recent studies. Finally, under the light of our novel results, we discuss the pros and cons of each current state-of-the-art link prediction strategy, concluding that none of the current solutions are what the future might hold for us. Future solutions might require the design and development of next generation “creative” AI that are able to generate and understand complex physical rules for us.


INTRODUCTION
From nests to nets intricate wiring diagrams surround the birth and death of life and we, as humans, wonder what the secret rules behind such elegant network architectures are. If nature adopts connectivity to shape complexity, humans might adopt their intelligence, or an artificial one, to make sense of it. To predict structural complexity in a network, should we act as small Prometheans, exploiting creativity to ''steal the fire from Gods'' and search, as physicists aim, for that accurate but simple, elegant unique rule that makes sense of ''everything''? Or should we ask artificial intelligence (AI) to stack for us hundreds of inaccurate human-made rules to make a new one that optimally summarizes them together? Perhaps, none of these two solutions is what the future holds for us.
Recently, Ghasemian et al. 1 embraced the second option and proposed a stacking model: a machine learning method for link prediction in complex networks based on ensemble meta-learning, using artificial intelligence to create a single meta-model from hundreds of other models. The supervised algorithm learns from data the errors made by the individual predictors and decides how to combine them into a single predictor. 1 Analyzing link prediction results on 550 real-world networks, the study concluded that the proposed stacking model for meta-learning is a state-of-the-art algorithm achieving nearly optimal prediction. 1 However, the scientific community might benefit to interpret the results of stacking meta-learning 1 when it is compared with other state-of-the-art algorithms that offer competitive performance according to computational strategies which largely differ from ensemble metalearning.
An example of a completely different approach is given by the structural predictability method (SPM), 2 an elegant model-free algorithm relying on a theory derived from quantum mechanics that exploits the firstorder perturbation of the graph adjacency matrix to compute the inherent predictability of the network. SPM takes a way opposite to that of stacking: it simply does not assume any model. Studies have confirmed that SPM is among the best-performing state-of-the-art global approaches for topological link prediction. 2,3 Another family of approaches for modeling and inference on networks is based on the statistical mechanics of complex systems. [4][5][6] In particular, a widely developed framework for statistical analysis on networks is the stochastic block model (SBM), also adopted for prediction tasks such as community detection and link The conclusion is that brute-force stacking of numerous algorithms by ensemble meta-learning AI does not perform better than (and is often significantly outperformed by) one simple brain-bioinspired rule such as CH. Also SPM and SBM seem generally to perform better than stacking. When we consider mean performance ( Figure 1) for 3 evaluation measures (precision, AUC-PR, and AUC-mROC) and 6 network domains (18 total comparisons), CHA significantly outperforms stacking in 14/18 comparisons (with median improvement of 64.5%) and stacking never significantly outperforms CHA. If we consider win rate performance (Figure 2), CHA significantly outperforms stacking in 16/18 comparisons (with median improvement of 383%) and stacking significantly outperforms CHA in 1/18 comparisons (with improvement of 23%).
The result of this analysis agrees with the Gö del incompleteness: stacking might be nearly optimal but incomplete, by stacking you cannot squeeze out more than what is in your features and their interaction. To overcome this limitation, we ran comparisons including in the stacking procedure SPM, SBM, and each CH-model adopted in CHA (stacking basic + CH + SPM + SBM, in Figures 1 and 2). In particular, we stacked the scores of 8 CH-models (RA-L2, RA-L3, CH1-L2, CH1-L3, CH2-L2, CH2-L3, CH3-L2, and CH3-L3) and the 8 related CH-SPcorr scores, 11 for a total of 16 CH features included in the stacking algorithm, in addition to 1 SPM feature and 1 SBM feature (see Methods section for details). Differently from the conclusion of Ghasemian et al. that stacking more good methods would improve the performance to nearly optimal, we found that, overall, there is no clear advantage in adding CH, SPM, and SBM to the basic stacking. Indeed, stacking basic and stacking basic + CH + SPM + SBM show comparable performances over different network domains and evaluation measures. However, if we look closer at the results of Ghasemian et al., 1 this is not totally unexpected. Let's consider Table 1 of Ghasemian et al., 1 which reports the link prediction performance of individual algorithms and of stacking models on the 550 real-world networks, evaluated using AUC-ROC, precision, and recall. For precision, which is the most appropriate to evaluate link prediction among the three measures reported (as we will comment later), the best individual predictor is S-NB 13 (spectral method with non-backtracking matrix), with a mean precision of 0.32. On the other side, stacking all the model-based predictors (including S-NB) leads to a mean precision of 0.25. In particular, all the stacking variants reported in Table 1 of Ghasemian et al. 1 (including topological, model-based, and embedding-based predictors) have a mean precision lower than the one of the best individual predictor (S-NB). The main claims of Ghasemian et al. 1 about near optimality are based on evaluations using AUC-ROC, although several recent studies have shown that AUC-ROC is inappropriate for early retrieval problems and for unbalanced prediction tasks, 9,[14][15][16][17][18][19][20] such as link prediction. A recent study of Muscoloni and Cannistraci 21 proposes the AUC-mROC that we report in the main figures of this study (in lieu of AUC-ROC whose values are provided for completeness in Tables 1 and 2 of this study), which adjusts the evaluation issues largely commented in the literature about the AUC-ROC. In conclusion, the precision-based results in Table 1 of Ghasemian et al., 1 as well as the results in our study, are rising concrete concerns about the near optimality of stacking models when the link prediction task is evaluated according to more appropriate evaluation measures.
In Tables 1 and 2, we report the same results as Figures 1 and 2 including AUC-ROC and some additional evaluation measures (AUC-precision, NDCG, and MCC) as well as the mean over the network domains. The rationale to introduce these other evaluation measures is the following. The area under the precision curve (AUC-precision) was adopted in previous studies of network-based prediction of protein interaction in For each network and for each link prediction method, the link prediction evaluation framework is applied (10 repetitions). The barplots report, for each network domain (biological, economic, informational, social, technological, and transportation) and for each evaluation measure (precision, AUC-PR, and AUC-mROC), the mean performance of each method over the networks of that domain and over the 10 repetitions. The arrows highlight the percentage of improvement between CHA and stacking basic as jCHA À stackingj minðCHA;stackingÞ $100. The asterisks represent the p value significance of the permutation test for the mean, comparing the performances of CHA and stacking basic on the networks of that domain: one (*), two (**), or three (***) asterisks depending on whether the p value is lower than or equal to the significance thresholds 0.05, 0.01, or 0.001, respectively. The number of networks of each domain is indicated in brackets. Error bars show the standard error of the mean, and the mean performance is reported on top of them. iScience Article order to overcome the limitations of AUC-ROC deceptiveness in link prediction. 10,22 The normalized discounted cumulative gain (NDCG) was proposed 23,24 for assessment of algorithms in information retrieval of documents with the aim to address the inaptness of AUC-ROC in early retrieval problems such as link prediction. Therefore, as suggested by Tao Zhou in one of his recent reviews, 25 we decided to apply NDCG also in link prediction evaluation. Finally, as sanity check that AUC-ROC evaluation is misleading, we consider a measure of binary classification known as the Matthews correlation coefficient (MCC). 26,27 MCC is a binary classification rate that generates a high score only if the binary predictor was able to correctly predict most of positive data instances and most of negative data instances. 28,29 This means that, differently from AUC-ROC, MCC provides a fair estimate of the predictor performance in class unbalanced datasets such as the one in link prediction problem. However, differently from the other early retrieval measures (precision, AUC-PR, AUC-precision, NDCG, and AUC-mROC) MCC does not give more importance to the positive class and it fairly and balanced considers the position in the ranking of positive and negative (in our case non-observed links) instances.
We can notice that all the evaluation measures except AUC-ROC provide coherent results, highlighting CHA as the method with the best mean value over the network domains (both for mean performance and win rate). AUC-ROC, instead, for some specific domains leads to a different conclusion, and indicates SBM as the best method for mean performance over the domains. The misleading evaluation performance of AUC-ROC that we notice in this large investigation agrees with recent studies reporting AUC-ROC as a deceptive measure for link prediction evaluation. 14,15 However, our finding that MCC-which is designed as AUC-ROC to be a binary classification rate and not an early retrieval measure-disagrees with AUC-ROC and clearly agrees with the other early retrieval measures, provides an incontrovertible evidence that AUC-ROC is unreliable for evaluation of some link prediction scenario. This result is explained by the fact that MCC, as AUC-ROC, neglects the early retrieval nature of the problem but, differently from AUC-ROC, is able to adjust for the class unbalance. To understand better the scenarios in which AUC-ROC works and the ones in which it is misleading, we made another specific study that analyzes the misleading results of AUC-ROC and proposes the AUC-mROC as a new evaluation measure. AUC-mROC can adjust the AUC-ROC for evaluation of early retrieval problems in general and link prediction in particular. 21 For this reason, we prefer to report AUC-mROC in the main figures of this study, and AUC-ROC in the Tables 1  and 2.

DISCUSSION
In conclusion, we might benefit to pursue new advanced strategies that overcome the current ones by means of ''creative'' AI that resembles human-like physical ''understanding'' of simple generalized rules associated with complexity. 30 The methods and ways to design this next generation ''creative'' AI for complexity understanding are still under debate. 31 The future might be populated by AI that ''steals for us the fire from Gods'', toward machine intelligence that creates new rules, and then stack them with the ones already known.  The barplots report, for each network domain (biological, economic, informational, social, technological, and transportation) and for each evaluation measure (precision, AUC-PR, and AUC-mROC), the win rate of each method over the networks that domain and over the 10 repetitions. The win rate is the proportion of networks and repetitions in which the method obtains the best performance among all the methods, according to a certain evaluation measure. The value is shown on top of the bar. The arrows highlight the percentage of improvement between CHA and stacking basic as jCHA À stackingj minðCHA;stackingÞ $ 100. The asterisks represent the p value significance of the permutation test for the win rate, comparing the wins and losses of CHA and stacking basic on the networks of that domain: one (*), two (**), or three (***) asterisks depending on whether the p value is lower than or equal to the significance thresholds 0.05, 0.01, or 0.001, respectively. The number of networks of each domain is indicated in brackets.  iScience Article dynamics. Both SBM and CH-modeling, together with its adaptive implementation called CHA, in part reply to this necessity to explicitly understand the mechanisms and rules of the complex system behind the network structure, but they are incomplete similarly to stacking model. This means that they might fail whenever the set of rules that they encode are not respected by a certain network topology. Furthermore, future studies might consider different types of prediction tasks for evaluation of link prediction. For instance, prediction of links on temporal networks, which means to predict the appearance of links in the future using a previous network structure; or prediction of links on synthetic networks, 32,33 which allows us to investigate how the link predictors change their performance in relation to topological features that can be tuned by parameters in the generative synthetic network model. 32,33

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

OPEN ACCESS
It has been shown that different complex networks can be organized according to different patterns of connectivity, such as L2 or L3, and therefore there is not a unique network automaton model that would be able to effectively describe all of them. For example, while performing link prediction on a social network one might decide to adopt a L2-based method, whereas on a PPI network it would likely be better to choose a L3-based method. The Cannistraci-Hebb Adaptive (CHA) network automaton has been designed to adapt to the network under investigation and to automatically select the model that would likely provide the best prediction. 11 In order to do this, it exploits a particular property of the CH network automata models. Such deterministic rules for link prediction can assign both to observed and non-observed links a score that is comparable, meaning that the scores of observed links are not biased to be higher or lower than the scores of nonobserved links. This is because the mathematical equation to compute the score of the connection between two nodes is independent from the existence of that link, whether the link is observed or not in the current topology does not affect the score.
Given a model, it is possible to assign likelihood scores to both observed and non-observed links and compute the AUC-PR to evaluate how well the model can discriminate them. The assumption is that, if the model tends to score observed links higher than non-observed links, then it would be more effective in predicting missing or future links. Therefore, the adaptive network automaton works as follows: given a network and a set of candidate models, for each model it computes the AUC-PR, then it automatically selects as link prediction result the scores of the non-observed links from the model that obtained the highest AUC-PR.
In this study, we considered the following CH models within the CHA network automaton: CH2-L2, CH3-L2, CH2-L3, CH3-L3. In the original study these models have been highlighted as the best set of models to adopt for building the adaptive version. 11 In addition, each of the four CH models applied the associated CH-SPcorr score for sub-ranking, in order to internally sub-rank all the node pairs characterized by the same CH score, reducing the ranking uncertainty of node pairs that are tied-ranked. 11 The MATLAB code of the CHA method and of the CH models used for the link prediction simulations have been developed by the authors and is publicly available at the GitHub repository associated to this study: https://github.com/ biomedical-cybernetics/stealing_fire_or_stacking_knowledge_to_model_link_prediction.

Structural perturbation method (SPM)
The structural perturbation method (SPM) relies on a theory similar to the first-order perturbation in quantum mechanics. 2 A high-level description of the procedure is the following: (1) randomly remove 10% of the links from the network adjacency matrix X, obtaining a reduced network X' = X -R, where R is the set of removed links; (2) compute the eigenvalues and eigenvectors of X'; (3) considering the set of links R as a perturbation of X 0 , construct the perturbed matrix X P via a first-order approximation that allows the eigenvalues to change while keeping fixed the eigenvectors; (4) repeat steps 1-3 for 10 independent iterations and take the average of the perturbed matrices X P . The link prediction result is given by the values of the average perturbed matrix, which represent the scores for each node pair. The higher the score the greater the likelihood that the interaction exists. The idea behind the method is that a missing part of the network is predictable if it does not significantly change the structural features of the observable part, represented by the eigenvectors of the matrix. If this is the case, the perturbed matrices should be good approximations of the original network. 2 The MATLAB implementation of the SPM method has been provided by the authors of the original study 2 and is publicly available at the GitHub repository associated to this study: https:// github.com/biomedical-cybernetics/stealing_fire_or_stacking_knowledge_to_model_link_prediction.

Stochastic block model (SBM)
The general idea of stochastic block model (SBM) is that the nodes are partitioned into B blocks and a B x B matrix specifies the probabilities of links existing between nodes of each block. SBM provides a general framework for statistical analysis and inference in networks, in particular for community detection and link prediction. 7 The concept of degree-corrected (DC) SBM has been introduced for community detection tasks 35  iScience Article predictive performance is higher when averaging over collections of partitions than when considering only the single most plausible partition, since this can lead to overfitting. 37 Therefore for a given network we sampled P partitions, for each partition we obtained the likelihood scores related to the non-observed links, and then considered the average likelihood scores as the link prediction result. We set p = 100 for networks with N % 100, p = 50 for 100 < N % 1000, p = 10 for N > 1000. The code of the SBM method has been released by the authors of the original study 7 within the Graph-tool Python module available at: https://graph-tool.skewed.de/. The implementation used for the link prediction application is publicly available at the GitHub repository associated to this study: https://github.com/biomedical-cybernetics/ stealing_fire_or_stacking_knowledge_to_model_link_prediction.

Stacking algorithm
Stacking is a meta-learning approach that can learn from data how to combine individual predictors into a single, potentially more accurate algorithm. It combines the predictors by learning a supervised model of input characteristics and the corresponding errors made by individual predictors. 1 The version of the algorithm named ''stacking basic'' in the figures, which corresponds to the implementation released by the authors, contains 42 topological predictors as topological features. In particular, there are 8 network global features, 14 node-pair based features, 20 single-node based features. The complete list is provided in

Link prediction evaluation framework
The 10% link removal evaluation framework is adopted when there is no information available about missing links or links that will appear in the future with respect to the time point of the network under consideration.
Given a network X, 10% of links are randomly removed, obtaining a reduced network X' = X -R, where R is the set of removed links. For evaluating each algorithm, the reduced network X 0 is given in input, obtaining in output a ranking of the non-observed links in X' by decreasing likelihood scores. The prediction performance is evaluated considering as positive samples the set R of links previously removed, according to several evaluation measures: precision, area under precision curve (AUC-precision), area under precision-recall curve (AUC-PR), area under ROC curve (AUC-ROC), area under mROC curve (AUC-mROC), normalized discounted cumulative gain (NDCG), Matthews correlation coefficient (MCC). The reference articles to these evaluation measures are provided in the main text of the study. Due to the randomness of the link removal, the evaluation is repeated 10 times. The MATLAB code of the evaluation measures have been developed by the authors and is publicly available at the GitHub repository associated to this study: https://github.com/biomedical-cybernetics/stealing_fire_or_ stacking_knowledge_to_model_link_prediction.

Real-networks dataset
The dataset consists of the 550 real-world networks adopted by Ghasemian et al., 1  The mean performance evaluation of link prediction is reported in Figure 1 and Table 1. For each network and for each link prediction method, the link prediction evaluation framework is applied (10 repetitions). Then, for each network domain and for each evaluation measure, we compute the mean performance and standard error of the mean of each method over the networks of that domain and over the 10 repetitions. For example, there are 179 networks in the Biological domain, therefore the mean performance and standard error of the mean are computed over 179 3 10 = 1790 values.
The win rate evaluation of link prediction is reported in Figure 2 and Table 2. The procedure is analogous to the one described above, except that the win rate is computed instead of the mean performance. The win rate is defined as the proportion of networks and repetitions (for example 179 3 10 = 1790 in the Biological domain) in which a certain method obtains the best performance among all the methods, according to a certain evaluation measure.
In order to assess if the mean performance of two methods is significantly different, we apply a permutation test of the mean. The results are shown for CHA versus stacking basic in Figure 1. For example, in the Biological domain and considering the precision as evaluation measure, the permutation test is applied on the 1790 precision values of CHA versus the 1790 precision values of stacking basic. The absolute difference of the two means of the two vectors is the observed statistic. For 1000 iterations, the two vectors are randomly permuted, and the same statistic is computed, obtaining a null distribution of statistics. Finally, the empirical p value is computed as the proportion of null statistics that are greater than or equal to the observed statistic.
In order to assess if the win rate of two methods is significantly different, we apply the same permutation test of the mean described above, but for each method the vector containing the performance values (for example of length 1790 in the Biological domain) is transformed into a binary vector representing the wins (value 1) or losses (value 0) of that method among all the methods, according to a certain evaluation measure. In practice, the mean of such binary vector is equal to the win rate of the method.
In Figures 1 and 2, as a result of these statistical tests comparing the performances of CHA versus stacking basic, we show one (*), two (**) or three (***) asterisks depending on whether the p value is lower than or equal to the significance thresholds 0.05, 0.01 or 0.001 respectively.